ChaosAPI: Detecting Flaky Tests by Controlling Nondeterministic API Behavior

This website contains ChaosAPI, the artifact accompanying our paper. ChaosAPI is a Java agent that perturbs selected JDK behaviors (e.g., time, randomness, sleep, sockets, and common time-bounded waits) to help surface flaky tests—without modifying application code. The package includes the agent binary and minimal scripts for building and running on standard JVMs. For quick use, build with Maven and attach the agent via -javaagent:...; configuration is supplied as simple key=value pairs.

See detailed usage in our source code repository:

https://github.com/Yhcrown/ChaosAPI

Quick Start

Build

mvn -DskipTests package

artifact: target/chaos-agent.jar

Run

java -javaagent:/path/to/chaos-agent.jar=packages=com.example.app;com.foo,randomStrategy=FACTOR,randomValue=20,timeStrategy=SAME -jar your-app.jar

Experiment Result

We provide all analysis scripts and the manually labeled results here:

Artifact-link

The artifact includes the following components:

failures_for_analyze.zip
Archive of all failures and their failure messages, automatically extracted. Directory layout:
.../failures_for_analyze/{project_name}/{test_leaf}/{config}/
Each config directory contains the corresponding failure message file and, when available, the original message from the FlakeFlagger dataset.
Note: some messages may appear under an unexpected test directory if the failure occurred outside the test class (e.g., in suite setup or initialization).
failure_analyze.py
Analysis script that ingests failures_for_analyze and computes the distribution of failures and message-level matches across strategy configurations.
manual_label.json
Per-project ground truth with four entries:
- TTP: Manually confirmed (or automatically message-matched) flaky tests that are also present in the FlakeFlagger dataset.
- FTP: Manually confirmed flaky tests that are not in FlakeFlagger.
- FN: Flaky tests in FlakeFlagger that our tool did not report (false negatives).
- FP: Tests reported by our tool but confirmed not flaky (false positives).
matches_config_to_original.json
For each strategy configuration, lists the reported flaky tests whose failure messages exactly match the originals in FlakeFlagger.
matches_config_to_original.csv
CSV version of the above for convenient tabular processing.
mvn_time_per_project.csv
End-to-end test time per configuration for each project (used in RQ4).

mvn_time_summary.csv
Average end-to-end test time per configuration across projects (used in RQ4).
overlap_pair_tests.json
For each pair of strategy configurations, the set of reported flaky tests that share the same normalized failure message.
per_project_failing_tests.csv
The set of flaky tests revealed by re-running.
project_dict.json
Project-level summary: enumerates all test names, identifies flaky tests, and records the tool’s reporting outcome for each project.

This package is intended to make our analyses reproducible and to facilitate downstream evaluation or replication.

Page updated

Google Sites

Report abuse